AITopics | full gradient

Collaborating Authors

full gradient

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

a1e865a9b1065392ed6035d8ccd072d9-Paper.pdf

Neural Information Processing SystemsFeb-13-2026, 07:59:23 GMT

Unfortunately,the per-iteration cost of maintaining this adaptivedistribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, inreality,leads to slower convergence in time.

artificial intelligence, estimation, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Harris County > Houston (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

1cb524b5a3f3f82be4a7d954063c07e2-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 16:55:25 GMT

dataset, gradient, theorem 3, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Michigan (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

a1e865a9b1065392ed6035d8ccd072d9-Paper.pdf

Neural Information Processing SystemsOct-3-2025, 09:17:11 GMT

convergence, gradient, sgd, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Harris County > Houston (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

37f0e884fbad9667e38940169d0a3c95-Reviews.html

Neural Information Processing SystemsOct-3-2025, 08:58:05 GMT

The optimal first-order algorithm of Nesterov has linear convergence for such problem but the constant depends on the square root of the condition number k. The authors consider the situation where one has access to the expensive full gradient of the objective as well as a cheap stochastic gradient oracle. They propose a hybrid algorithm which only requires O(log 1/eps) calls to the full gradient oracle (independent of the condition number) and O(k^2 log(1/eps)) calls to the cheaper stochastic gradient oracle -- as long as the condition number is not too big, this could be faster in theory. The main idea behind their algorithm(called Epoch Mixed Gradient Descent - EMGD) is to replace a full gradient step (called an epoch) with a fixed number O(k^2) of mixed gradient steps which use a combination of the full gradient (computed once for the epoch) and stochastic gradients (which vary within an epoch). By taking the average of the O(k^2) iterates within an epoch, they can show a constant decrease of the suboptimality *independent* of the condition number, which is why the number of required full gradient step computations (the number of epochs) is independent from the condition number. They provide a simple and complete self-contained proof of their convergence rate, but no experiment.

algorithm, condition number, gradient step, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Nevada (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)

Add feedback

Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes Hao Chen

Neural Information Processing SystemsOct-2-2025, 08:52:59 GMT

In this paper, we establish convergence guarantees for both the full gradient and the model parameters.

dataset, gradient, theorem 3, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Michigan (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

Linear Convergence with Condition Number Independent Access of Full Gradients

Neural Information Processing SystemsSep-30-2025, 11:28:03 GMT

For smooth and strongly convex optimization, the optimal iteration complexity of the gradient-based algorithm is $O(\sqrt{\kappa}\log 1/\epsilon)$, where $\kappa$ is the conditional number. In the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be computationally expensive. In this paper, we propose to reduce the number of full gradient required by allowing the algorithm to access the stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use an combination of the gradient and the stochastic gradient to update the intermediate solutions. By performing a fixed number of mixed gradient descents, we are able to improve the sub-optimality of the solution by a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows that EMGD is able to find an $\epsilon$-optimal solution by computing $O(\log 1/\epsilon)$ full gradients and $O(\kappa^2\log 1/\epsilon)$ stochastic gradients.

condition number independent access, gradient, linear convergence, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence

Chen, Jiahe, Ma, Ziye

arXiv.org Artificial IntelligenceSep-30-2025

Optimizing large-scale nonconvex problems, common in deep learning, demands balancing rapid convergence with computational efficiency. First-order (FO) optimizers, which serve as today's baselines, provide fast convergence and good generalization but often incur high computation and memory costs due to the large size of modern models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method that extends mini-batch SGD with full-batch ZO gradients under an SVRG-style framework. VAMO's hybrid design utilizes a two-point ZO estimator to achieve a dimension-agnostic convergence rate of $\mathcal{O}(1/T + 1/b)$, where $T$ is the number of iterations and $b$ is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD's $\mathcal{O}(1/\sqrt{T})$ rate. Additionally, we propose a multi-point variant that mitigates the $O(1/b)$ error by adjusting the number of estimation points to balance convergence and cost. Importantly, VAMO achieves these gains with smaller dynamic memory requirements than many FO baselines, making it particularly attractive for edge deployment. Experiments including traditional neural network training and LLM finetuning confirm that VAMO not only outperforms established FO and ZO methods, but also does so with a light memory footprint.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.13954

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

Medyakov, Daniil, Molodtsov, Gleb, Chezhegov, Savelii, Rebrikov, Alexey, Beznosikov, Aleksandr

arXiv.org Artificial IntelligenceFeb-20-2025

In today's world, machine learning is hard to imagine without large training datasets and models. This has led to the use of stochastic methods for training, such as stochastic gradient descent (SGD). SGD provides weak theoretical guarantees of convergence, but there are modifications, such as Stochastic Variance Reduced Gradient (SVRG) and StochAstic Recursive grAdient algoritHm (SARAH), that can reduce the variance. These methods require the computation of the full gradient occasionally, which can be time consuming. In this paper, we explore variants of variance reduction algorithms that eliminate the need for full gradient computations. To make our approach memory-efficient and avoid full gradient computations, we use two key techniques: the shuffling heuristic and idea of SAG/SAGA methods. As a result, we improve existing estimates for variance reduction algorithms without the full gradient computations. Additionally, for the non-convex objective function, our estimate matches that of classic shuffling methods, while for the strongly convex one, it is an improvement. We conduct comprehensive theoretical analysis and provide extensive experimental results to validate the efficiency and practicality of our methods for large-scale machine learning problems.

algorithm, assumption 1, gradient, (15 more...)

arXiv.org Artificial Intelligence

2502.14648

Country: Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback

Smoothed Gradients for Stochastic Variational Inference

Stephan Mandt, David Blei

Neural Information Processing SystemsFeb-9-2025, 22:35:46 GMT

Stochastic variational inference (SVI) lets us scale up Bayesian computation to massive data. It uses stochastic optimization to fit a variational distribution, following easy-to-compute noisy natural gradients. As with most traditional stochastic optimization methods, SVI takes precautions to use unbiased stochastic gradients whose expectations are equal to the true gradients. In this paper, we explore the idea of following biased stochastic gradients in SVI. Our method replaces the natural gradient with a similarly constructed vector that uses a fixed-window moving average of some of its previous terms. We will demonstrate the many advantages of this technique. First, its computational cost is the same as for SVI and storage requirements only multiply by a constant factor. Second, it enjoys significant variance reduction over the unbiased estimates, smaller bias than averaged gradients, and leads to smaller mean-squared error against the full gradient. We test our method on latent Dirichlet allocation with three large corpora.

artificial intelligence, machine learning, optimization problem, (14 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

Add feedback

Reviews: Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond

Neural Information Processing SystemsJan-25-2025, 00:33:13 GMT

After Rebuttal: Thank you for the responses. I that believe the paper will be even stronger with the inclusion of the stochastic gradient-variant. This is a very valuable theorem, which will be useful for other theoreticians working in this field. On the other hand, to the best of my knowledge, this is the first paper that uses a stochastic Runge-Kutta integrator for sampling from strongly log-concave densities with explicit guarantees. The authors further show that their proposed numerical scheme improves upon the existing guarantees when applied to the overdamped Langevin dynamics.

full gradient, ml community, runge-kutta accelerate langevin monte carlo, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.43)

Add feedback